1. Gene Expression Matries

Level Quantifier Metric FilePath
Transcript RSEM Raw counts /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/conda_env/bulkRNAseq_2025/pipeline/testdata/Quantification/_research_jude_rgs01_jude_groups_yu3grp_projects_software_JY_yu3grp_conda_env_bulkRNAseq_2025_pipeline_databases_hg38_gencode.release48/01_expressMatrix_transcriptLevel.RSEM_Count.txt
Transcript RSEM TPM /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/conda_env/bulkRNAseq_2025/pipeline/testdata/Quantification/_research_jude_rgs01_jude_groups_yu3grp_projects_software_JY_yu3grp_conda_env_bulkRNAseq_2025_pipeline_databases_hg38_gencode.release48/01_expressMatrix_transcriptLevel.RSEM_TPM.txt
Transcript RSEM FPKM /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/conda_env/bulkRNAseq_2025/pipeline/testdata/Quantification/_research_jude_rgs01_jude_groups_yu3grp_projects_software_JY_yu3grp_conda_env_bulkRNAseq_2025_pipeline_databases_hg38_gencode.release48/01_expressMatrix_transcriptLevel.RSEM_FPKM.txt
Transcript Salmon Raw counts /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/conda_env/bulkRNAseq_2025/pipeline/testdata/Quantification/_research_jude_rgs01_jude_groups_yu3grp_projects_software_JY_yu3grp_conda_env_bulkRNAseq_2025_pipeline_databases_hg38_gencode.release48/01_expressMatrix_transcriptLevel.Salmon_Count.txt
Transcript Salmon TPM /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/conda_env/bulkRNAseq_2025/pipeline/testdata/Quantification/_research_jude_rgs01_jude_groups_yu3grp_projects_software_JY_yu3grp_conda_env_bulkRNAseq_2025_pipeline_databases_hg38_gencode.release48/01_expressMatrix_transcriptLevel.Salmon_TPM.txt
Gene RSEM Raw counts /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/conda_env/bulkRNAseq_2025/pipeline/testdata/Quantification/_research_jude_rgs01_jude_groups_yu3grp_projects_software_JY_yu3grp_conda_env_bulkRNAseq_2025_pipeline_databases_hg38_gencode.release48/01_expressMatrix_geneLevel.RSEM_Count.txt
Gene RSEM TPM /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/conda_env/bulkRNAseq_2025/pipeline/testdata/Quantification/_research_jude_rgs01_jude_groups_yu3grp_projects_software_JY_yu3grp_conda_env_bulkRNAseq_2025_pipeline_databases_hg38_gencode.release48/01_expressMatrix_geneLevel.RSEM_TPM.txt
Gene RSEM FPKM /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/conda_env/bulkRNAseq_2025/pipeline/testdata/Quantification/_research_jude_rgs01_jude_groups_yu3grp_projects_software_JY_yu3grp_conda_env_bulkRNAseq_2025_pipeline_databases_hg38_gencode.release48/01_expressMatrix_geneLevel.RSEM_FPKM.txt
Gene Salmon Raw counts /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/conda_env/bulkRNAseq_2025/pipeline/testdata/Quantification/_research_jude_rgs01_jude_groups_yu3grp_projects_software_JY_yu3grp_conda_env_bulkRNAseq_2025_pipeline_databases_hg38_gencode.release48/01_expressMatrix_geneLevel.Salmon_Count.txt
Gene Salmon TPM /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/conda_env/bulkRNAseq_2025/pipeline/testdata/Quantification/_research_jude_rgs01_jude_groups_yu3grp_projects_software_JY_yu3grp_conda_env_bulkRNAseq_2025_pipeline_databases_hg38_gencode.release48/01_expressMatrix_geneLevel.Salmon_TPM.txt


2. Alignment statistics

Alignment statistics is used to tell:
1)If you have sequenced enough reads? Usually, >25M mapped reads and >10M uniquely-mapped reads are expected;
2)If your reads have been over-trimmed? Usually, >90% Percentage of Filtered Reads is expected. Over-trimming could happen when the insert size is too small or you picked the wrong Phred score encoding method;
3)If your samples are contaminated? Usually, >75% Percentage of Mapped Reads is expected. The most common reason for low mapping rate is the contamination of either DNA or rRNA.

Sample Name Number of Total Reads Number of Filtered Reads Percentage of Filtered Reads Number of Mapped Reads Percentage of Mapped Reads Number of Uniquely-mapped Reads Percentage of Uniquely-mapped Reads
sample1 19410373 19163544 98.73% 15836491 82.64% 14663015 92.59%
sample3 19410373 19163544 98.73% 15836491 82.64% 14663015 92.59%
* Percentage of Uniquely-mapped Reads = Number of Uniquely-mapped Reads / Number of Mapped Reads
* Percentage of Mapped Reads = Number of Mapped Reads / Number of Filtered Reads
* For paired-end libraries, the numbers in the tables count read pairs.


3. Quantification statistics

Quantification statistics helps to tell how accurate the quantification results are:
1)How many transcripts/genes were identified confidently? Usually, >65K transcripts and/or >15K genes are expected;
2)How accurate the quantification is? Usually, >0.85 correlation coefficient is expected.

3.1 Transcript level

Sample Name Identified by RSEM Identified by Salmon Identified by Both Coef_Pearson Pval_Pearson rho_Spearman Pval_Spearman
sample1 91688 87038 79984 0.9790 0 0.9699 0
sample3 91688 86994 79953 0.9788 0 0.9698 0
* Only co-identified transcripts/genes were used in correlation analysis.

3.2 Gene level

Sample Name Identified by RSEM Identified by Salmon Identified by Both Coef_Pearson Pval_Pearson rho_Spearman Pval_Spearman
sample1 22114 22429 21363 0.9798 0 0.9877 0
sample3 22114 22426 21363 0.9796 0 0.9877 0
* Only co-identified transcripts/genes were used in correlation analysis.


4. Biotype distribution

Biotype distribution provides the composition of types of identified transcripts and genes. Since different bioytpes of transcripts/genes vary hugely in length, GC-content and other properties, Biotype distribution could serve as a measure of quantification quality. Empirically,
1) For total RNA libraries, the protein_coding transcripts and genes should account for >40% and > 50%, respectively; 2) For mRNA libraries, the protein_coding transcripts and genes should accounts for >80. For more details about the biotypes: https://www.gencodegenes.org/pages/biotypes.html.

4.1 Transcript level

sampleName protein_coding retained_intron lncRNA protein_coding_CDS_not_defined nonsense_mediated_decay Others
sample1 43.19% 20.26% 16.79% 9.15% 8.71% 1.9%
sample3 43.19% 20.26% 16.79% 9.15% 8.71% 1.9%
* The biotypes of <1% were marked as ‘NA’ and merged into ‘Others’.

4.2 Gene level

sampleName protein_coding lncRNA processed_pseudogene TEC Others
sample1 63.41% 29.16% 3.57% 1.23% 2.63%
sample3 63.41% 29.16% 3.57% 1.23% 2.63%
* The biotypes of <1% were marked as ‘NA’ and merged into ‘Others’.


5. Genebody coverage statistics

Genebody coverage statistics calculates the RNA-Seq reads overage over gene body.
1) Mean of Coverage is the average coverage of all 100 bins of gene body. Usually, >0.7 is expected.
2) Coefficient of Skewness is a measure of the asymmetry of the distribution of gene body coverage. Fisher’s moment coefficient of skewness was calculated by default. The closer to 0, the more symmetric. For more details: https://en.wikipedia.org/wiki/Skewness.

sampleName meanCoverage coefSkewness
sample1 0.8449504 0.0756538
sample3 0.8449504 0.0756538